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Abstract 

In this paper we consider the training of single hidden layer neural net¬ 
works by pseudoinversion, which, in spite of its popularity, is sometimes 
affected by numerical instability issues. Regularization is known to be ef¬ 
fective in such cases, so that we introduce, in the framework of Tikhonov 
regularization, a matricial reformulation of the problem which allows us to 
use the condition number as a diagnostic tool for identihcation of instabil¬ 
ity. By imposing well-conditioning requirements on the relevant matrices, 
our theoretical analysis allows the identihcation of an optimal value for the 
regularization parameter from the standpoint of stability. We compare with 
the value derived by cross-validation for overhtting control and optimisation 
of the generalization performance. We test our method for both regression 
and classihcation tasks. The proposed method is quite effective in terms of 
predictivity, often with some improvement on performance with respect to 
the reference cases considered. This approach, due to analytical determina¬ 
tion of the regularization parameter, dramatically reduces the computational 
load required by many other techniques. 

Keywords: Regularization parameter, Gondition number. Pseudoinversion, 
Numerical instability 


1. Introduction 

In past decades Single Layer Feedforward Neural Networks (SLFN) train¬ 
ing was mainly accomplished by iterative algorithms involving the repetition 
of learning steps aimed at minimising the error functional over the space of 
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network parameters. These techniques often gave rise to methods slow and 
computationally expensive. 

Researchers therefore have always been motivated to explore alternative 
algorithms and recently some new techniques based on matrix inversion have 
been developed. In the literatur e, they were initially emplo yed to train radial 
basis function neural networks f Poggio and Giro^. 1990a): the idea of using 
them also for diff erent neural architectures was suggested for instance in 
(Cancelliere, 200ll) . 

The work by Huang et ah (see for instance ( Huang et ai . 2006[l ) gave rise 
to a great interest in neural network community: they presented the tech¬ 
nique of Extreme Learning Machine (ELM) for which SLFNs with randomly 
chosen input weights and hidden layer biases can learn sets of observations 
with a desired precision, provided that activation functions in the hidden 
layer are inhnitely differentiable. Besides, because of the use of linear output 
neurons, output weights determination can be brought back to linear systems 
solution, obtained via Moore-Penrose generalised inverse (or pseudoinverse) 
of the hidden layer output matrix; so doing iterative training is no more 
required. 

Such techniques appear anyway to require more hidden units with respect 
to conventional neural network trainin g algorithms to ach ieve comparable 
accuracy, as discussed in Yu and Deng (lYu and Dengl. 120121) . 

Many application-oriented studies in the last years have been devoted 
to the use of these single-pass techniques , easy to implement and computa¬ 
tionally fast; some are de scribed e.g. in ( Nguven et ai . 2010l: Kohno et al. 


2010l:lAiorloo et a/.l. 120071) . A yearly conference is currently being held on the 


subject, the International Conference on Extreme Learning Machines, and 
the method is currently dealt with in some journal special issue, e.g. Soft 
Computing ( Wang et al\. 2012 ) and the Int ernational Jo urnal of Uncertainty, 
Fuzziness and Knowledge-Based Systems ( Wang . 2013[) . 

Because of the possible presence of singular and almost singular matrices, 
pseudoinversion is known to be a powerful but numerically unstable method: 
nonetheless in the neural network community it is often used without singu¬ 
larity checks and evaluated through approximated methods. 

In this paper we improve on the theoretical framework using singular 
value analysis to detect the occurrence of instability. Building on Tikhonov 
regul arization, which is known to be effective in this context ( Golub et al\. 


19991 ). we present a technique, named Optimally Conditioned Regularization 


for Pseudoinversion (OCReP), that replaces unstable, ill-posed problems with 
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well-posed ones. 

Our approach is based on the formal dehnition of a new matricial for¬ 
mulation that allows the use of condition number as diagnostic tool. In 
this context an optimal value for the regularization parameter is analytically 
derived by imposing well-conditioning requirements on the relevant matrices. 

The issue of regularization parameter choice has often been identihed as 
crucial in literature, and dealt with in a number of historical contributions: a 
conservative guess might put its published estimates at several dozens. Some 
of the most relevant works are mentioned in section |2l where the related 
theoretical background is recalled. 

Its determination, mainly aimed at overhtting control, has often been 
done either experimentally via cross-validation, requiring heavy computa¬ 
tional training procedures, or analytically under specihc conditions on the 
matrices involved, sometimes hardly applicable to real datasets, as discussed 
in section [21 

In section [3] we present the basic concepts concerning input and output 
weights setting, and we recall the main ideas on ill-posedness, regularization 
and condition number. 

In section 0] our matricial framework is introduced, and constraints on 
condition number are imposed in order to derive the optimal value for the 
regularization parameter. 

In section O our diagnosis and control tool is tested on some applica¬ 
tions selected from the UCI database and validated by comparison with the 
framework regularized via cross-validation and with the unregularized one. 

The same datasets are used in section [6] to test the technique effective¬ 
ness: our performance is compared with those obtained in other regularized 
frameworks, originated in both statistical and neural domains. 

2. Recap on ordinary least-sqnare and ridge regression estimators 

As stated in the introduction, pseudoinversion based neural training brings 
back output weights determination to linear systems solution: in this section 
we recall some general ideas on this issue, that in next sections will be spe¬ 
cialized to deal with SLFN training. 

The estimate of /3 through ordinary least-squares (OLS) technique is a 
classical tool for solving the problem 

Y = Xl3 + e, (1) 
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where Y and e are column n-vectors, /9 is a column p-vector and X is an 
n X p matrix ; e is r ando m, with expectation value zero and variance cr^. 

In f Hoeri 19621) and f Hoerl and Kennard . 1970l) the role of ordinary ridge 


regression (ORR) estimator /3(A) as an alternative to the OLS estimator 
in the presence of multicollinearity is deeply analized. In statistics, multi- 
collinearity (also collinearity) is a phenomenon in which two or more predictor 
variables in a multiple regression model are highly correlated, meaning that 
one can be linearly predicted from the others with a non-trivial degree of ac¬ 
curacy. In this situation, the coefficient estimates of the multiple regression 
may change erratically in response to small changes in the model or the data. 

It is known in literature that there exist estimates of /3 with smaller 


(Golub et ai. 

1979: 

Bereer. 

1976) 


/3(0) = . 


( 2 ) 


Allowing for some bias may result in a significant v ariance red u ction : 
this is known as t he bias-variance dilemma (see e.g. fiTibshiranil . Il996 


Geman et al\. Il992l) . whose effects on output weights determination will be 


deepened in section 13.21 

Hereafter we focus on the one parameter family of ridge estimates /3(A) 
given by 

/3(A) = {X^X + nXiy^X^Y . (3) 

It can be shown that /3(A) is also the solution to the problem of finding 
the minimum over /3 of 


|R 


n 


xy\i + 


( 4 ) 


which is k nown as the method of regularization in the approximation theory 
literature (IGolub et all Il979h : basing on it we will develop the theoretical 
framework for our work in the next sections. 

There has always been a substantial amount of interest in estimating a 
good value of A from the data: in addition to those already cited in this 
section a non-exhaustive list of well known or more recent papers is e.g 


llHoerl and Kennard 


1975 1 : Nordberj. 1982 : Saleh and Kibrial. 1993 
2 OO 5 I: Mardikvan and Getin . 2008 1 


9761:[Lawless and Wang. Il976l: McDonald and Galarnead . 


Kibrial. l2003l: iKhalaf and Shukurl . 
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_ A meaningful review of these formulations is provided in flDorugade and Kashidl . 

2010). They hrst dehne the matrix T such that T^X^XT = A (A = 


diag{Xi, A 2 , • • ■ Ap) contains the eigen values of the matrix X'^X ); then they 
set Z = XT and a = T^/5, and show that a great amount of different 
methods require the OLS estimates of a and a 


a = {Z^Zy^Z^Y , 


(5) 


< 7 ^ = 


Y^Y-a^Z^Y 


1 (6) 

n — p — 1 

to dehne effective ridge parameter values. It is important to note that often 
specihc conditions on data are needed to evaluate these estimators. 

In par ticular this a pplies to the expressions of the ridge parameter pro¬ 
posed by ( Kibrial. 2003) and I Hoerl and Kennard . 1970l) . that share the char¬ 
acteristic of being functions of the ratio between and a function of a; they 
will be used for comparison with our proposed method in section [6l 

The alternative tech nique of generalised cross-validation (GCV) proposed 
by flGolub et al\. llOTOl ) provides a good estimate of A from the data as the 
minimizer of 


= r 


|/-A(A)F|| 


where 


n[iTrace(/-A(A))] 


A(A) = X{X^X + nXI)-^X‘ 


(7) 


( 8 ) 


This solution is particularly interesting, since it does not require an es¬ 
timate of because of this, it will be one term of comparison with our 
experimental results in section |6l 

In the next section we will show how the problem of hnding a good so¬ 
lution to (11]) applies to the context of pseudoinversion based neural training, 
specializing the involved relevant matricies to deal with this issue. 


3. Main ideas on regularization and condition number theory 

3.1. Generalised inverse matrix for weights setting 

We deal with a standard SLFN with L input neurons, M hidden neurons 
and Q output neurons, non-linear activation functions 0 in the hidden layer 
and linear activation functions in the output layer. 
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Considering a dataset of N distinct training samples where G 

and tj G the learning process for a SLFN aims at prodncing the 
matrix of desired ontpnts T G when the matrix of all inpnt instances 

X G is presented as inpnt. 

As stated in the introdnction, in the psendoinverse approach the matrix 
of inpnt weights and hidden layer biases is randomly chosen and no longer 
modihed: we name it C. After having hxed C, the hidden layer ontpnt matrix 
H = (j){XC) is completely determined; we nnderline that since H G , 

it is not invertible. 

The nse of linear ontpnt nenrons allows to determine the ontpnt weight 
matrix W* in terms of the OLS solntion to the problem T = HW + e, in 
analogy with eq.([T]). Therefore from eq.([2]), we have 


W* = {H^H)-^H^T 


According to ( Penrose and Tod^. 1956 : Bishop . 2006 ) 


(9) 


W* = H+T . 


( 10 ) 


is the Moore-Penrose psendoinverse (or generalized inverse) of matrix 
H, and it minimises the cost fnnctional 


Ed = \\HW -T\\l 


( 11 ) 


Singnlar valne decomposition (SVD) is a compntation ally simple and ac- 
cnrat e way to compnte the psendoinverse (see for instance flGolub and Van Loan! . 


19961) ). as follows. 


Every matrix H G 


hNxM 


can be expressed as 


H = UEV^ 


( 12 ) 


where U G and V G are orthogonal matrices and E G R^^^ 

is a rectangnlar diagonal matrix (i.e. a matrix with aih = 0 if i ^ h); 
its elements = ai, called singular values, are non-negative. A common 
convention is to list the singular values in descending order, i.e. 


(^1 > (72 > ■ ■ ■ > CTp > 0 (13) 

where p = min {N, M}, so that E is uniquely determined. 

The SVD of H is then used to obtain the psendoinverse matrix 
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H+ = 


( 14 ) 


where S’*" G 


pMxN 


is again a rectangular diagonal matrix whose elements 
are obtained by taking the rec i proca l of each corresponding element: 




+ 


= l/(Tj (see also ( Rao and Mitral. 1971)). From eq.([9]) we than have: 


W* = VE+U^T, (15) 

Remark 

An interesting case occurs when only k < p elements in eq.([T3]) are non-zero, 
i.e. cTfc+i = • • • = (Tp = 0; in this case the rank of matrix H is k and is 
dehned as: 


S+= diag(l/cri,---,l/afc,0, •••,0) G 


oMxN 


( 16 ) 


as shown for instance in ( Golub and Van Loani 1996 1. 

This is also often done in practice, for computational reasons, for ele¬ 
ments smaller than a predehned threshold, thus actually computing an ap¬ 
proximated yersion of the pseudoinverse matrix 

This approach is for example used by default for pseudoinverse evaluation 
by means of the Matlab pinv function DJ, because the tool is widely used by 
many scientists for example in ELM context, each time that it is applied 
blindly, i.e. without having decided at what threshold to zero the small (Tj, 
an approximation a priori uncontrolled is introduced in evaluation. 


3.2. Stability and generalization properties of regularization algorithms 

A key property for any learning algorithm is stability: the learned map¬ 
ping has to suffer only small changes in presence of small perturbations (for 
instance the deletion of one example in the training set). 

Another important property is generalization: the performance on the 
training examples (empirical error) must be a good indicator of the perfor¬ 
mance on future examples (expected error), that is, the difference between 
the two must be small. An algorithm that guarantees good generalization 
predicts well if its empirical error is small. 


http: / / www.mathworks .com/help / matlab / ref/pinv. html. 
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Many studies in literature dealt with the connection between stability and 
generalization: the notion of stabil ity has been investigated by several au¬ 
thors, e.g. by Devroye and Wagn er fiDevrove and Wagnerl. Il979l) and Kearns 
and Ron flKearns and Ronl. Il999h. 


Poggio et ah in flMukheriee et al. 


2003[l introduced a statistical form 
of leave-one-out stability, named CVEEEioo, building on a cross-validation 
leave-one-out stability endowed with conditions on stability of both expected 
and empirical errors; they demonstrated that this condition is necessary and 
sufficient for generalization and consistency of the class of empirical risk min¬ 
imization (ERM) learning algorithms, and that it is a lso a sufficient condi tion 
for generalisation for not ERM algorithms (see also f Poggio et all 2004 1). 

To turn an original instable, ill-posed problem into a well-posed one, reg- 
ularization methods of the form (01) are often used flBadeva and Morozovl . 


199ll) and among them, Tikhonov regularization i s one of the most com¬ 
mon fiTikhonov and Arseninl . Il977l: iTikhonovl . 1196311 . It minimises the error 
functional 

E = Ed + Er=\\HW-T\\1 + \\TW\\1 (17) 


obtained adding to the cost functional Er in eq. dTT]) a penalty term Er 
that depends on a suitably chosen Tikhonov ma trix P. This issue has been 


discussed in its a pplications to neura l networks in (iPoggio and Girosil.ll990bll . 


and surveyed in fiGirosi et all 11995 


Havkinl. Il999ll. 


Besides, Bousquet and Elisseeff ( Bousquet and Elisseeff . 2002ll proposed 
the notion of uniform stability to characterize the generalization properties 
of an algorithm. Their results state that Tikhonov regularization algorithms 
are uniformly stable an d that uniform stability implies good generalization 
( Mukheriee et al . 2006 1. 

Regularization thus introduces a penalty function that not only improves 
on stability, making the problem less sensitive to initial conditions, but it is 
also important to contain model complexity avoiding overfitting. 

The idea of penalizing by a square function of weights is also well known 
in neural literature as weight decay: a wide amount of articles have been 
devoted to this argument, and more generally to the adva ntage of regu l ariza¬ 


tion for the control of overfitting. Among them w e re call (Hastie et al . 20091: 


Tibshirani . 1996 : Bishop . 2006 : Girosi et al. . 1995 : Pu, 199^ Gallinari and Gibas . 

I999I1. 


A frequent choice is P = y/jl, to give preference to solutions with smaller 







































































































norm (IBishopl . l2006l) , so eq. ffT7|) can be rewritten as 

E=Ed + Eh= \\H\V - T||^ + imwl 


( 18 ) 


We define W = Yiim.w{E) the regularized solution of fflSj) : it belongs to 
the family of ridge estimates described by eq.([ 3 ]) and can be expressed as 


or, as shohw in ff Fuhrv and Reich^ . 2012)) as 


(19) 


W = VDU^T. (20) 

V and [/ are from the singular value decomposition of H (eq.([T 2 ])) and 
D G is a rectangular diagonal matrix whose elements, built using the 

singular values cij of matrix E, are: 


D, 


CTj 


( 21 ) 


We remark on the difference between the minima of the regularized and 
unregularized error functionals. Increasing values of the regularization pa¬ 
rameter 7 induce larger and larger departure of the former (eq. flT^ i from 
the latter (eq. (jO])). Thus, the regularization process increases the bias of 
the approximating solution and reduces its variance, as discussed about the 
bias-variance dilemma in section | 2 l 

A suitable value for the Tikhonov parameter 7 has therefore to derive from 
a compromise between having it sufficiently large to control the approaching 
to zero of Uj in eq. fl 2 T|) . while avoiding an excess of the penalty term in 
eq. 01. Its tuning is therefore crucial. 


3.3. Condition number as a measure of ill-posedness 

The condition number of a matrix A G is the number /i(A) defined 

as 

a(^) = I|.4||||/1+|| (22) 

where H-H is any matrix norm. If the columns (rows) of A are linearly inde¬ 
pendent, e.g. in case of experimental data matrices, then A~^ is a left (right) 
inverse of A, i.e. A~^A = In [AA'^ = Im)- The Cauchy-Schwarz inequality 
in this case then provides pi{A) > 1 ; besides, /i(A) = ■ 

Matrices are said to be ill-conditioned if pi{A) 3> 1. 
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If 


norm is used, then 


/i(A) = 


ap{Ay 


(23) 


where ai and ap are the largest and smallest singular values of A respectively. 

From eq. fl2^ we can easily understand that large condition numbers fi{A) 
suggest the presence of very small singular values (i.e. of almost singular 
matrices), whose numerical inversion, required to evaluate and the un¬ 
regularized solution W*, is a cause of instability. 

From numeric linear algebra we also know that if the condition number 
is large the problem of hnding least-squares solutions to the correspond¬ 
ing system of linear equations is ill-posed, i.e. even a small perturbation 
in the data can lead to huge perturbations in the entries of solution (see 


(iGohib and Van Load. Il99f 

According to f Mukheriee et all. 20061) the stability of Tikhonov regular¬ 
ization algorithms can also be characterized using the classical notion of con¬ 
dition number: our proposed regularization method hts within this context. 
We will see that it specihcally aims at analitically determining the value of 
the 7 parameter that minimizes the conditioning of the regularized hidden 
lay er output matrix so tha t the solution W is stable in the sense of eq.(2.9) 


of flMukheriee et al\. 120061) . 


In the next section, we will derive the optimal value of the regulariza¬ 
tion parameter 7 according to this stability criterion (minimum condition 
number). 

The experimental results presented in sections [5] and [ 6 ] will evidence that 
our quest for stable solutions allows us to also achieve good generalization and 
predictivity. A comparison will be made to this purpose with the performance 
obtained when 7 is determined via the standard cross-validation approach, 
aimed at overhtting control and generalization performance optimization. 


4. Conditioning of the regnlarized matricial framework 

For convenient implementation of our diagnostics, and building on eq. fl20ll . 
we propose an original matricial framework in which to develop our study 
tool with the following dehnition. 

Definition 1. We define the matrix 

Rreg ^ y ^ 24 ) 
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as the regularized hidden layer output matrix of the neural network. 

This allows us to rewrite eq.([20]) as 

W = , (25) 


for similarity with eq.(l9]). 

By construction, is decomposed in three matrices according to the 
SVD framework, and its singular values are provided by eq. fl2ip as a function 
of the singular values ai of H. 

This new regularized matricial framework makes easier the comparison of 
the properties of with those of the corresponding unregularized matrix 

. In fact, when unregularized pseudoinversion is used, nothing prevents 
the occurrence of very small singular values that make numerically instable 
the evaluation of (see eq. [TT|) . On the contrary, even in presence of very 
small values cTj of the original unregularized problem, a careful choice of the 
parameter 7 allows to tune the singular values Di of the regularized matrix 
j^reg, preventing numerical instability. 


4 . 1 . Condition number definition 

According to eq. fl23|) . we dehne the condition number of ^g; 

. (26) 

^min 

where Dmax and Dmin are the largest and smallest singular values of if'"®®. 

The shape of the functional relation a j (cr^ + 7 ) that links regularized and 
unregularized singular values, dehned through eq. CD, is shown in Fig{T]for 
three different values of 7 . 

The curves are non-negative, because a > 0 and 7 > 0, and have only 
one maximum, with coordinates (-^/y; 7 ^)- 

A few pairs of corresponding values (Dj, Uj) are marked by dots on each 
curve. 

For the sake of the determination of we are interested in evalu¬ 

ating Dmax and Dmin of over the hnite, discrete range [ui, ( 72 ,..., afi\. 

The value Dmax is reached in correspondence to a given singular value of 
H, a priori not known, that we label Umax^ so that: 


D 


max 


(Jr\ 


(Jt. 


+ 7 


(27) 
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a 


Figure 1: Example of regularized/unregularized singular values relationship via eq. (12111 


The variation of 7 has the effect of changing the curve and shifting its 
maximum point within the interval [ai,ap]. Therefore, (Jmax can coincide 
with any singular value of H from eq. flT^ . including the extreme ones. 

Conversely, we now demonstrate that Dmin can only be reached in corre¬ 
spondence to cTi or (jp (or both when coincident). 

Theorem 3.1 

The minimum singular value Dmin of rnatrix can only be reached 
in correspondence to the largest singular value Ui or to the smallest singular 
value CTp of the unregularized matrix H (or both). 

Proof. Without loss of generality, we can express 7 as a function of aicXp, i.e. 
7 = jSaiap, where /? is a real positive value. By replacement in eq. fl?T]l . we 
get 

Di = -^-, D„ = -^- 

0-1 + (3(rp ap + /ScTi 

To establish their ordering, we evaluate the difference A of their inverses: 
A = ^ ^ = (di + (3ap) - {ap + /3ai) = (1 - /3)(ai - ap) . 

Ui Up 
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Recalling that cxi — dp > 0, we can distinguish three cases: 

Case 1 , > 1 (7 > aio-p) —)■ A < 0 —)■ Di> Dp 

Because of the Di distribution shape, Dp is also the minimum among 
all values Di, so that Dmin = Dp. 

Case 2, /3 < 1 (7 < aicXp) —)■ A > 0 —)■ Di < Dp 

Then, Di is also the minimum among all values Di, so that Dmin = -Di- 


Case 3, /3 — 1 (7 — o'lO’p) —t A — 0 —y Di — Dp 

Thus, Di and Dp are both minima, so that Dmin = Di = Dp. 


□ 


4-2. Condition number evaluation 

The result by Theorem 3.1 allows us to hnd, according to eq. 
following expressions for : 

Case 1, /3 > 1: 

Dmax ^maxi^^p + (dai) 




Dr. 


Case 2, (3 < 1: 




Dr 


^max + 


Case 3, /3 = 1: 




Dl Kax + 

Dmax _ Dmax _ (^maxis^p T Cq) 

Dp Dl al^r^x + ^i(^P 


, the 


Bearing in mind that well-conditioned problems are characterized by 
small condition numbers, we now will look for the jd parameter values which, 
in the three cases above, make the regularized condition number smaller. 

In Case 1, is an increasing function of (3, so that in its domain, 

i.e. (l,cxo), its minimum value is reached when (3 —)■ 1+. On the contrary, 
in Case 2 , is a decreasing function of (3, so that in its domain, i.e. 

(0,1), the minimum is reached when (3 —)■ 1“. 

Figj2] shows the function behaviour over the whole domain. 

Both cases have a common limit: 
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p 


Figure 2: Regularized condition number vs. P 


lim = lini = ^rnax{(Jp + tXi) 

/ 3 ^ 1 + / 3 ^ 1 - <ax + ^l^P 

Such value is just that provided by Case 3, which can therefore be considered 
the best possible choice to minimize the condition number. 

Thus our quest for the best possible conditioning for the matrix 
identihes an explicit optimal value for the regularization parameter 7 : 

7 = CTiCTp (29) 


5. Simulation and Discussion 


For the numerica l experimentation, we use ei ght benchmark datasets from 
the UCI repository ( Bache and Tdchmau . 20131) listed in Table [H All simu¬ 
lations are carried out in Matlab 7.3 environment. 

The performance is assessed by statistics over a set of 50 different extrac¬ 
tions of input weigths, computing either the average RMSE (for regression 
tasks) or the average percentage of misclassihcation rate (for classihcation 
tasks) on the test set. Either quantity is labeled “Err” in the tables sum¬ 
marising our results. The error standard deviation (labeled “Std”) is also 
computed to evidence the dispersion of experimental results. 
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Dataset 

Type 

N. Instances 

N. Attributes 

N. Classes 

Abalone 

Regression 

4177 

8 

- 

Machine Cpu 

Regression 

209 

6 

- 

Delta Ailerons 

Regression 

7129 

5 

- 

Housing 

Regression 

506 

13 

- 

Iris 

Classihcation 

150 

4 

3 

Diabetes 

Classihcation 

768 

8 

2 

Wine 

Classihcation 

178 

13 

3 

Segment 

Classihcation 

2310 

19 

7 


Table 1: The UCI datasets and their characteristics 


Our regularization strategy, labeled Optimally Conditioned Regulariza¬ 
tion for Pseudoinversion (OCReP), is verified by simulation against the com¬ 
mon approach in which cross-validation is used i) to determine the regular¬ 
ization parameter 7 at a hxed high number of hidden neurons and ii) to 
perform also hidden neurons number optimization, respectively in sec. 15.11 
and 15.21 

A discussion of the effectiveness of OCReP in terms of minimization of 
the condition number of the involved matricies is done in sec. 15.31 

5.1. OCReP performance assessment: fixed number of hidden units 

In this section we compare OCReP with a regularization approach in 
which 7 is selected by a cross-validation scheme, which is typically used 
for control of under/overhtting and optimization of the model generalization 
performance. A 70%/30% split between training and test set is applied; then, 
a three-fold cross-validation search on the training set identihes the best 7 
by best performance on the validation set, over the set of 50 values of 7 
[ 10 - 25 , 

For the sake of comparison, a hxed, high number of hidden units M is 
used, selected according to dimension and complexity of the datasets. For 
the three datasets Machine Cpu, Iris and Wine the simulation is performed 
for 50 and 100 hidden neurons; for Abalone, Delta Ailerons, Housing and 
Diabetes, we use 50, 100, 200 and 300 neurons; for Segment, we use 1000 
and 1500 units. 

Figures [3] and 0] (respectively for regression and classihcation datasets) 
show average test errors as a function of the sampled values of 7 (red dots); 
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Figure 3: Test error trends for regression datasets as a function of the values of 7 over 
the selected cross-validation range (red dots): the cross-validation selected 7 is the black 
square; the proposed 7 from OCReP is the blue circle. 


the standard deviation is shown as an error bar. Onr proposed optimal 7 is 
evidenced as a bine circle, whereas the valne of 7 selected by cross-validation 
is shown as a black sqnare. The results are in each case related to the highest 
number of neurons experimented. 

The horizontal axis has been zoomed in onto the region of interest, i.e. 

[10-^°, 107 

It may be noted that the performance from OCReP and cross-validation 
are comparable, and also close to the experimental minimum. This may be 
interpreted as good predictivity for both algorithms. 

Also, we remark that the error bars, i.e. experimental result dispersion. 
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Iris (M=100) Wine (M=100) 




Figure 4: Test error trends for classification datasets as a function of the values of 7 over 
the selected cross-validation range (red dots): the cross-validation selected 7 is the black 
square; the proposed 7 from OCReP is the blue circle. 
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is large for small values of 7 , consistently with expectations on ineffective 
regularization. 


Table 2: Comparison of OCReP vs. cross-validation at fixed number of hidden neurons 
for small size datasets 






Iris 

Wine 

Machine Cpu 



OCReP 

Err. 

Std 

1.51 

2.98 

31.21 


50 

1.13 

1.75 

1.1 


cross-val. 

Err. 

2.13 

3.37 

31.1 

M 


Std 

0.77 

2.27 

1.02 



OCReP 

Err. 

Std 

2.53 

1.39 

34.13 


100 

0.77 

1.19 

01.68 


cross-val. 

Err. 

Std 

2.17 

0.31 

1.88 

1.88 

30.94 

0.69 




Table 3: Comparison of OCReP vs. cross-validation at fixed number of hidden neurons 
for large size datasets 


Segment 



1000 

OCReP 

Err. 

Std 

2.53 

0.77 

M 

cross-val. 

Err. 

Std 

2.17 

0.31 


1500 

OCReP 

Err. 

Std 

4.41 

0.45 


cross-val. 

Err. 

Std 

3.97 

0.35 


The numerical results have been reported in Tab. [2l |3] and H] according to 
the grouping based on dimension and complexity of the datasets. 

For each dataset and selected number of hidden neurons M, the best test 
error is evidenced in bold, whenever the difference is statistically significantH. 


^The Student’s t-test has been used for assessing the statistical significance through 
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Table 4: Comparison of OCReP vs. cross-validation at fixed number of hidden neurons 
for medium size datasets. For Delta Ailerons, average errors and standard deviations have 
to be multiplied by lO”"*. 





Abalone 

Delta Ailerons 

Housing 

Diabetes 



OCReP 

Err. 

2.22 

1.64 

5.54 

26.01 


50 

Std 

0.16 

0.0051 

0.12 

0.604 


cross-val. 

Err. 

2.13 

1.59 

4.79 

26.79 



Std 

0.017 

0.0073 

0.37 

0.814 



OCReP 

Err. 

2.15 

1.62 

5.17 

25.66 


100 

Std 

0.007 

0.004 

0.08 

0.608 


cross-val. 

Err. 

2.11 

1.58 

4.49 

25.71 

M 


Std 

0.006 

0.0036 

0.28 

0.608 



OCReP 

Err. 

2.12 

1.59 

4.62 

25.13 


200 

Std 

0.003 

0.0031 

0.09 

0.445 


cross-val. 

Err. 

2.11 

1.61 

4.30 

25.79 



Std 

0.003 

0.0096 

0.27 

0.443 



OCReP 

Err. 

2.113 

1.58 

4.24 

24.26 


300 

Std 

0.03 

0.0018 

0.13 

0.689 


cross-val. 

Err. 

2.114 

1.60 

4.18 

25.66 



Std 

0.003 

0.0042 

0.23 

0.456 


Thus, for example, on Iris the best performance is achieved using 50 neu¬ 
rons by OCReP, and with 100 neurons by cross-validation. In some cases, e.g. 
Wine (50 neurons), there is no clear winner from statistical considerations, 
i.e. the best results are comparable, within the errors. 

From the above results it appears that cross-validation has better test 
error performance on a number of datasets slightly higher, at hxed number 
of hidden nenrons. However, it is important to evidence that the use of 
OCReP allows to save the hundreds of psendoinversion steps reqnired by 
cross-validation, wich is a crucial issue for practical implementation. 


determination of the confidence intervals related to 99% confidence level. 
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5.2. OCReP performance assessment: variable number of hidden units 

In order to pursue the double aim of performance and hidden units opti¬ 
mization, a hrst interesting step is to give a look to the variation as a function 
of hidden layer dimension of error trends of unregularized models (i.e. models 
whose output weights are evaluated according to eq. ffTU]) !. 

A context widely used a mong researchers us ing such techniques (see e.g. 
Helmv and Rashe^ ( 2009 ): Huang et al. ( 2006 )) is to use input weights dis¬ 
tributed according to a random uniform distribution in the interval (—1,1), 
and sigmoidal activation functions for hidden neurons: hereafter we name 
this framework Sigm-unreg. 


Abalone ^0 






Figure 5: Test error trends for regression datasets: OCReP vs. unregularized pseudoin¬ 
version. 

Figures[5]and[6]show, respectively for regression and classihcation datasets, 
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the average test error values, (over 50 different input weights selections) for 
both OCReP (blue line) and Sigm-unreg (red line) as a function of the num¬ 
ber of hidden nodes, which is gradually increased by unity steps. In all cases, 
after an initial decrease the Sigm-unreg test error increases significantly. 

On the contrary, the OCReP test error curves keep decreasing, albeit at 
slower and slower rate, thus showing also a good capability of overfitting 
control of the method. 

We aim now at comparing the results obtained when the trade-off value of 
7 is searched by cross-validation, with the two different frameworks discussed 
so far, i.e. OCReP and Sigm-unreg. 

A 70%/30% split between training and test set is applied; we then perform 
a three-fold cross-validation for the selection of the number of hidden neurons 
M at which the minimum error is recorded in all cases. Test errors are again 
evaluated as the average of 50 different random choices of input weights. 

The numerical results of the simulation are presented in Tables |5] and 
ini respectively for regression and classification tasks, with their standard 
deviations (Std) and M. 

Best test errors are evidenced in bold, whenever the difference between 
OCReP and cross-validation is statistically significant. 

We see that our proposed regularization technique provides, for regres¬ 
sion datasets, performance comparable with the cross-validation option but 
always a better performance (with statistical significance at 99% level) with 
respect to the unregularized case. 

For classification datatsets in three cases out of four OCReP provides 
a better performance with respect to cross-validation, and always a better 
performance with respect to the Sigm-unreg case. In all such cases, the 
statistical significance is at the 99% level. 

Also, in almost all cases smaller standard deviations are associated with 
the OCReP method, suggesting a lower sensitivity to initial input weights 
conditions. 

5.3. Additional considerations 

The proposed method OCReP presents in our opinion two features of 
interest: on one side, its computational efficiency, and on the other side its 
optimal conditioning. 

Our goal of optimal analytic determination of the regularization param¬ 
eter 7 results in a dramatic improvement in the computing requirements 
with respect to experimental tuning by search over a pre-defined large grid 
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Iris 


Wine 



Diabetes segment 



Figure 6: Test error trends for classification datasets: OCReP vs. unregularized pseudoin¬ 
version. 

of tentative values. In the latter case, for each choice of 7 over the se¬ 
lected range, at least a pseudoinversion is required for every output weight 
determination, thus increasing the computational load by a factor N^. 

Besides, our method is designed explicitly for optimal conditioning. In our 
simulations, we verify that the goal is fulhlled by evaluating average condition 
numbers of hidden layer output matrices. The statistics is performed over 
50 different conhgurations of input weights and a hxed number of hidden 
units, namely the largest used in section [5T] for each dataset. The results are 
summarised in Tables [7] and [H respectively for regression and classihcation 
datasets. On the hrst row of each table, we list the ratio of average condition 
numbers of matrices and H^, associated respectively to OCReP and 

Sigm-unreg, i.e. regularized and unregularized approaches. On the second 
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row, we list the ratio of average condition numbers of matrices and , 
thus comparing our regularization approach with the more conventional one, 
the latter using cross-validation. 

Not surprisingly, our regularization method provides a signihcant im¬ 
provement on conditioning with respect to the unregularized approach, as 
evidenced by ratio values much smaller than unity. Besides, OCReP also 
provides better conditioned matrices than those derived by selection of 7 
through cross-validation, since the corresponding condition numbers are sys¬ 
tematically smaller in the former case, sometimes up to an order of magni¬ 
tude. 


Table 5: Hidden layer optimization for regression tasks. For Delta Ailerons, average errors 
and standard deviations have to be multiplied by 10“^. 



Abalone 

Housing 

Delta Ailerons 

Machine Gpu 

OGReP 

Err. 

2.12 

4.25 

1.58 

31.22 

Std. 

0.32 

0.13 

0.0048 

0.78 

M 

178 

255 

298 

63 

Gross-validation 

Err. 

2.11 

4.19 

1.58 

31.51 

Std. 

0.0097 

0.25 

0.0036 

1.25 

M 

110 

250 

93 

70 

Sigm-unreg 

Err. 

2.14 

4.73 

1.62 

34.44 

Std. 

0.014 

0.20 

0.57 

2.89 

M 

31 

76 

74 

15 


6. Comparison with other approaches 

Since the literature provides a host of different recipes for either the choice 
of the regularization parameter, or the actual regularization algorithm, here¬ 
after we focus on a couple of specihc frameworks. 


6.1. Other choices of regularization parameter 

Among the approaches mentioned in section [2l we primary select th e 
technique of generalised cross-validation (GCV) from (iGohib et al\ . Il979h . 
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Table 6: Hidden layer optimization for classification tasks. 



Iris 

Wine 

Diabetes 

Segment 

OCReP 

Err. 

1.6 

1.73 

25.53 

2.50 

Std. 

1.10 

1.25 

0.51 

0.32 

M 

67 

91 

291 

760 

Cross-validation 

Err. 

2.12 

2.10 

25.2 

2.65 

Std. 

1.26 

2.27 

1.29 

0.38 

M 

14 

137 

25 

620 

Sigm-unreg 

Err. 

2.31 

3.20 

25.92 

4.45 

Std. 

1.48 

2.09 

1.12 

0.47 

M 

67 

91 

291 

760 


Table 7: Condition number comparison for regression datasets 


Abalone Housing Delta Ailerons Machine Cpu 
0.0002 0.0008 0.00007 0.0001 

g g Q 3 g 3 


described by eqs. (I7j) and (IH]), for comparison with our method. The main 
motivation for our choice is its independence on the estimate of the error vari¬ 
ance cr^, which is a characteristic shared with our case. For each dataset, we 
select the same hxed numbers of hidden units as in section [5Tl then for each 
case eq. (171) is minimized over the set of 50 values of 7 [ 10 “^®, • • • 10 ^®] 

and for 50 different conhgurations of input weights. 

We evaluate the mean and standard deviation of the corresponding regu¬ 
larized test error, reported in Tables [H [ 10 ] and [TTJ We also remind that the 
tabulated error “Err” is either the average RMSE for regression tasks, or the 
average misclassihcation rate for classihcation tasks; “Std” is the correspond¬ 
ing standard deviation. The performance comparison is based on statistical 
signihcance at 99% level. 

Whenever GCV provides test error values statistically better than OCReP 
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Table 8: Condition number comparison for classification datasets 


Iris Wine Diabetes Segment 
0.00002 0.005 0.0007 0.000005 

Q 2 0.4 0.1 0.2 


Table 9: GCV results at fixed number of hidden neurons for small datasets 

Iris Wine Machine Cpu 


M 


50 

100 


Err. 

2.47 

3.66 

33.03 

Std 

1.06 

2.42 

1.27 

Err. 

3.06 

3.77 

36.06 

Std 

1.08 

2.44 

1.13 


(listed in Tab. Ej |3]and|l]), they are marked in bold. 

We remark that in all cases listed in Tab. E] and E] OCReP provides sta¬ 
tistically better results than GCV. The situation of medium size datasets 
evidences a somewhat mixed behaviour: with 50 hidden neurons, GCV wins; 
with 100 neurons, for three out of four datasets (i.e. Abalone, Housing and 
Diabetes) the performance is statistically comparable. In all other cases of 
Tab. m OCReP again provides better statistical results than GCV. 

We make two other comparisons, using the ridg e estimates described in 
eq.(13) ar id eg. (9) of ODorugade and Kashidl. l2010ll. a nd proposed respec¬ 
tively by (Kibria, 2003) and f Hoerl and Kennard . 1970 1: 


Table 10: GCV results at fixed number of hidden neurons for large size datasets 

Segment 



1000 

Err. 

11.39 

M 

Std 

0.75 

1500 

Err. 

14.72 



Std 

0.803 
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Table 11: GCV results at fixed number of hidden neurons for medium size datasets. For 
Delta Ailerons, average errors and standard deviations have to be multiplied by 10“^. 




Abalone 

Housing 

Delta Ailerons 

Diabetes 

50 

Err. 

2.13 

4.89 

1.60 

25.2 

Std 

0.017 

0.45 

0.0103 

1.22 

100 

Err. 

2.15 

5.05 

1.63 

26.66 

Std 

0.021 

0.70 

0.0297 

1.39 

200 

Err. 

2.32 

6.78 

1.74 

27.73 

Std 

0.10 

2.35 

0.0892 

1.27 

300 

Err. 

2.98 

8.07 

2.20 

27.14 

Std 

0.42 

2.89 

0.4054 

1.15 


1K = — , 


V 


a: 


(30) 


a 


7hk = 


at 


(31) 


Our experimentation is made only for regression datasets because the 
theoretical background of f Porugade and Kashid . 2010h . and of most of other 
works referred in section [2l directly applies to the case in which the quantity 
Y in eq.([T]) is a one column matrix. In our formulation Y is the desired target 
T and it is a one-column matrix only for regression tasks. 

For each dataset we applied both methods described by eq. [30] and EH we 
select the same hxed numbers of hidden units as in section 15.11 and perform 
50 experiments with different conhguration of input weights. 

Each step of pseudoinversion is regularized for each method with the 
corresponding 7 value. We evaluate the mean and standard deviation of the 
regularized test errors, reported respectively in Tables HO] and [T31 

Whenever the methods provide test error values statistically better than 
OCReP (listed in Tab. [2]and|3|), they are marked in bold. 

We remark that the method by Kibria obtains a better performance in 
two cases over sixteen, while OCReP in 12 cases over sixteen. Besides, the 
method by Hoerl and Kennard obtains a better performance in three cases 
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Table 12: Kibria estimate of ridge parameter: results at fixed number of hidden neurons 
for regression datasets. For Delta Ailerons, average errors and standard deviations have 
to be multiplied by lO”"*. 





Abalone 

Housing 

Delta Ailerons 

Machine Cpu 


50 

Err. 

2.32 

5.72 

1.63 

34.28 


Std 

0.37 

0.84 

0.027 

4.67 


100 

Err. 

2.38 

5.45 

1.64 

32.40 

M 

Std 

0.90 

0.86 

0.08 

3.72 

200 

Err. 

2.20 

5.31 

1.65 



Std 

0.13 

0.76 

0.15 



300 

Err. 

2.34 

5.46 

1.62 



Std 

1.01 

1.60 

0.035 



over sixteen, while OCReP in eight cases over sixteen. For both methods, 
better performance is achieved only for the case of M = 50 neurons. 

It may be noted that with respect to processing requirements OCReP has 
clear advantages, since it requires only a SVD step for each determination of 
7 , while the above two methods require full spectral decomposition and an 
additional matrix inversion. 


6.2. Alternative regularization methods 


A hrst comparison can be done with the work by Huang et ah (iHuang et al. 


201211 . whose technique Extreme Learning Machine (ELM) uses a cost param¬ 


eter C that can be considered as related to the inverse of our regularization 
parameter 7 . As authors state, in order to achieve good generalization per¬ 
formance, C needs to be chosen appropriately. They do this by trying 50 
different values of this parameter: [ 2 “^'^, 2 “^^, • • • 2 ^^, 2 ^®]. 

A fair comparison can be done on our classihcation datasets, using their 
number of hidden neurons, i.e. 1000. Our optimal choice of 7 allows to obtain 
a better performance on all datasets (with statistical signihcance assessed at 
the same conhd ence level that pre vious experiments). 

Deng et al. flDeng et all 120091 1 propose a Regularized Extreme Learning 
Machine (hereafter, RELM) in wich the regularization parameter is selected 
according to a similar criterion among 100 values: [2“^*^, 2“^®, • • • 2®°]. Be¬ 
cause their performance is optimized with respect to the number of hidden 
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Table 13: H-K estimate of ridge parameter: results at fixed number of hidden neurons for 
regression datasets. For Delta Ailerons, average errors and standard deviations have to be 
multiplied by 10“^. 





Abalone 

Housing 

Delta Ailerons 

Machine Cpu 


50 

Err. 

2.13 

4.87 

1.60 

34.28 


Std 

0.016 

0.44 

0.01 

2.37 


100 

Err. 

2.14 

4.98 

1.62 

37.39 

M 

Std 

0.90 

0.67 

0.029 

3.18 

200 

Err. 

2.33 

8.101 

1.73 



Std 

0.10 

2.83 

0.08 



300 

Err. 

2.95 

29.06 

2.21 



Std 

0.41 

9.26 

0.41 



Table 14: Comparison between OCReP and ELM 




Iris 

Wine 

Diabetes 

Segment 

OCReP 

Err. 

2.22 

1.28 

21.06 

3.40 

Std. 

0.21 

0.88 

0.65 

0.25 

ELM 

Err. 

2.4 

1.53 

22.05 

3.93 

Std 

2.29 

1.81 

2.18 

0.69 


neurons, for the sake of comparison we use OCReP values from table |6l 
We obtain a statistically significant better performance on dataset Segment, 
while for Diabetes the method RELM performs better (see table fT^ . 

Comparing our results on the common regression dat asets with the alter - 
native method TROP-ELM proposed by Miche et ah ( Miche et al. . 2011 1. 


we note that OCReP achieves always lower RMSE values 1^ (with statistical 
significance), as can be seen from table ITOl 

Besides, in our opinion our method is simpler, in the sense that it uses a 
single step of regularization rather tha n two. 

In flMartinez-Martinez et ai? .1 . 1201111 . an algorithm is proposed for pruning 


^In that work, performance and related statistics are expressed in terms of MSE; we 
only derived the corresponding RMSE for comparison with our results. 
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Table 15: Comparison between OCReP and RELM 




Diabetes 

Segment 


Err. 

25.53 

2.50 

OCReP 

Std. 

0.51 

0.32 


M 

291 

760 


Err. 

21.81 

4.49 

RELM 

Std. 

2.55 

0.0074 


M 

15 

200 


Table 16: Comparison between OCReP and TROP-ELM. For Delta Ailerons, average 
errors and standard deviations have to be multiplied by 10“^. 




Abalone 

Delta Ailerons 

Machine Cpu 

Housing 


Err. 

2.12 

1.58 

31.22 

4.25 

OCReP 

Std. 

0.32 

0.0048 

0.78 

0.13 


M 

178 

298 

63 

255 


Err. 

2.19 

1.64 

264.03 

34.35 

TROP-ELM 

M 

42 

80 

28 

59 


ELM networks by using regularized regression methods: the crucial step 
of regularization parameter determination is solved by creating K different 
models, each one based on a different value of this parameter, among which 
the best one is selected using a Bayesian information criterion. Authors 
state that a typical value for K is 100, thus an heavy computational load is 
required, and the method is focused on regression tasks. 

7. Conclusions 

In the context of regularization techniques for single hidden layer neural 
networks trained by pseudoinversion, we provide an optimal value of the reg¬ 
ularization parameter 7 by analytic derivation. This is achieved by dehning 
a convenient regularized matricial formulation in the framework of Singular 
Value Decomposition, in which the regularization parameter is derived un¬ 
der the constraint of condition number minimization. The OCReP method 
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has been tested on UCI datasets for both regression and classihcation tasks. 
For all cases, regularization implemented using the analytically derived 7 is 
proven to be very effective in terms of predictivity, as evidenced by compari¬ 
son with implementations of other approaches from the literature, including 
cross-validation. OCReP avoids hundreds of pseudoinversions usually needed 
by most other methods, i.e. it is quite computationally attractive. 
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